02-Data frames

Packages

To install a package pkgName, simply type install.packages("pkgName") into the console.

To use functions in the package, we first have to load the package: library(pkgName). If the package has not been installed yet, we will get an error. Otherwise, functions in the package are now available for use.

Data frames

Packages not only give us access to user-created functions, but also user-created datasets. In R, datasets are called data frames.

Let’s load the fueleconomy package (if you haven’t install this package yet, run this command first: install.packages("fueleconomy")):

library(fueleconomy)

Load the vehicles dataset with the data function (to find out more about the vehicles dataset, key in ?vehicles):

data(vehicles)

An entry vehicles pops up in the Environment tab. We can see that the dataset has ~33,000 observations with 12 variables.

Let’s view the data with the View() function (note the capital V). (Alternatively, we can click on “the”vehicles" in the Environment tab.) A new tab pops up in the top-left pane displaying the data. Clicking on the column names allows us to sort the data.

(Note: Some of you might not be able to click on “fueleconomy” in the Environment tab right away. Don’t worry about it, typing View(fueleconomy) into the console will still work, and you should be able to click on “fueleconomy” after that.)

Seeing parts of the data

33,000 observations is a lot of observations to look through. Instead of looking through all of it, we can use various functions to give us a feel for the data.

Use the head and tail functions to display the first few or last few rows of the dataset. To control the number of lines shown (default is 6), use the optional n argument.

head(vehicles)

##      id       make               model year                       class
## 1 27550 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General   DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General    FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5  1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6  1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
##             trans            drive cyl displ    fuel hwy cty
## 1 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 2 Automatic 3-spd    2-Wheel Drive   4   2.5 Regular  17  18
## 3 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 4 Automatic 3-spd    2-Wheel Drive   6   4.2 Regular  13  13
## 5 Automatic 3-spd Rear-Wheel Drive   4   2.5 Regular  17  16
## 6 Automatic 3-spd Rear-Wheel Drive   6   4.2 Regular  13  13

tail(vehicles, n = 2)

##          id  make                       model year       class
## 33441 33306 smart fortwo electric drive coupe 2013 Two Seaters
## 33442 34394 smart fortwo electric drive coupe 2014 Two Seaters
##                trans            drive cyl displ        fuel hwy cty
## 33441 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122
## 33442 Automatic (A1) Rear-Wheel Drive  NA    NA Electricity  93 122

Under the hood, data frames are implemented as lists, with each column being one element in the list. Hence, whatever we can do with lists, we can do with data frames. For example, we can get the data frame’s column names using name():

names(vehicles)

##  [1] "id"    "make"  "model" "year"  "class" "trans" "drive" "cyl"  
##  [9] "displ" "fuel"  "hwy"   "cty"

To access a particular column, we can use the [[ or $ notation:

vehicles$class[1:10]

##  [1] "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD"
##  [3] "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD"
##  [5] "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD"
##  [7] "Midsize Cars"                "Subcompact Cars"            
##  [9] "Subcompact Cars"             "Subcompact Cars"

Since the number of columns in a data frame is just the number of elements in a list, we can get the number of columns using length():

length(vehicles)

## [1] 12

We can also use the ncol() and nrow() functions to get the number of columns and rows of the data frame:

ncol(vehicles)

## [1] 12

nrow(vehicles)

## [1] 33442

Interestingly, data frames can act a little like matrices too. For example, we can use dim() to figure out the number of rows and columns in the data frame:

dim(vehicles)

## [1] 33442    12

To access the 30th row, we can type

vehicles[30, ]

##       id  make model year        class          trans             drive
## 30 16734 Acura 3.2TL 2001 Midsize Cars Automatic (S5) Front-Wheel Drive
##    cyl displ    fuel hwy cty
## 30   6   3.2 Premium  27  17

Getting an overview of the data

For an overview of the entire data set, the str function we introduced last session is very handy. For each column, str tells us what type of variable it is, as well as the first couple of values for the column.

str(vehicles)

## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

The summary function gives us some useful statistics for each variable:

summary(vehicles)

##        id            make              model                year     
##  Min.   :    1   Length:33442       Length:33442       Min.   :1984  
##  1st Qu.: 8361   Class :character   Class :character   1st Qu.:1991  
##  Median :16724   Mode  :character   Mode  :character   Median :1999  
##  Mean   :17038                                         Mean   :1999  
##  3rd Qu.:25265                                         3rd Qu.:2008  
##  Max.   :34932                                         Max.   :2015  
##                                                                      
##     class              trans              drive                cyl        
##  Length:33442       Length:33442       Length:33442       Min.   : 2.000  
##  Class :character   Class :character   Class :character   1st Qu.: 4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 5.772  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :16.000  
##                                                           NA's   :58      
##      displ           fuel                hwy              cty        
##  Min.   :0.000   Length:33442       Min.   :  9.00   Min.   :  6.00  
##  1st Qu.:2.300   Class :character   1st Qu.: 19.00   1st Qu.: 15.00  
##  Median :3.000   Mode  :character   Median : 23.00   Median : 17.00  
##  Mean   :3.353                      Mean   : 23.55   Mean   : 17.49  
##  3rd Qu.:4.300                      3rd Qu.: 27.00   3rd Qu.: 20.00  
##  Max.   :8.400                      Max.   :109.00   Max.   :138.00  
##  NA's   :57

We can also do summaries on just one column:

summary(vehicles$hwy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   19.00   23.00   23.55   27.00  109.00

For just the mean or median, use the mean and median functions on the column of interest:

mean(vehicles$hwy)

## [1] 23.55128

median(vehicles$hwy)

## [1] 23

The sd() and var() functions compute the standard deviation and variance of a vector for us:

sd(vehicles$hwy)

## [1] 6.211417

var(vehicles$hwy)

## [1] 38.5817

Note that the default types for the variables don’t always make sense. For example, does it make sense to take the mean of id numbers? To change the type of a column, use the as.x function (where x is the type you want to change to):

vehicles$id <- as.character(vehicles$id)
str(vehicles)

## Classes 'tbl_df', 'tbl' and 'data.frame':    33442 obs. of  12 variables:
##  $ id   : chr  "27550" "28426" "27549" "28425" ...
##  $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
##  $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
##  $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
##  $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
##  $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
##  $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
##  $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
##  $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
##  $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
##  $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

Factors

Look at the output of summary(vehicles) again. Note that for all the character variables, summary() doesn’t give us any information on them. One way to get information on character variables is to use the table() function:

table(vehicles$drive)

## 
##              2-Wheel Drive              4-Wheel Drive 
##                        507                        699 
## 4-Wheel or All-Wheel Drive            All-Wheel Drive 
##                       6647                       1267 
##          Front-Wheel Drive    Part-time 4-Wheel Drive 
##                      12233                         96 
##           Rear-Wheel Drive 
##                      11993

Another way we can get information on character variables is by converting them to factors. Factors represent categorical variables: i.e. values fall into one of several categories (e.g. gender, age group). Categories can be unordered (e.g. gender, we call them nominal variables), or ordered (e.g. age group, we call them ordinal variables).

We can make a character variable into a factor variable by using factor(). Notice now that summary() gives more useful information. (By default, factor variables are nominal variables.)

vehicles$drive <- factor(vehicles$drive)
summary(vehicles$drive)

##              2-Wheel Drive              4-Wheel Drive 
##                        507                        699 
## 4-Wheel or All-Wheel Drive            All-Wheel Drive 
##                       6647                       1267 
##          Front-Wheel Drive    Part-time 4-Wheel Drive 
##                      12233                         96 
##           Rear-Wheel Drive 
##                      11993

Let’s look at the internal structure of the factor variable:

str(vehicles$drive)

##  Factor w/ 7 levels "2-Wheel Drive",..: 1 1 1 1 7 7 7 5 5 5 ...

Notice that the words (“2 Wheel Drive”, etc.) have been changed into numbers! That’s because R assigns each category a number. We can see this assignment somewhat by calling levels(), which shows us the “levels”, or categories, for this variable:

levels(vehicles$drive)

## [1] "2-Wheel Drive"              "4-Wheel Drive"             
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"           
## [5] "Front-Wheel Drive"          "Part-time 4-Wheel Drive"   
## [7] "Rear-Wheel Drive"

So 2-Wheel Drives are labeled 1, and so on. By default, R assigns this internal labeling by alphabetical order. This internal labeling is usually not a concern to us. See optional material section for more details.

Working with NAs

Let’s compute the mean number of cylinders in our dataset:

mean(vehicles$cyl)

## [1] NA

Hmm, we get an NA? What’s happening? If we look through the cyl column, you’ll find that some of the entries are NA. Look at the documentation for the mean function and you’ll see that there is an na.rm option, with default value FALSE. This means that by default, mean will not remove any NAs that it sees, and will return NA if any one of the elements is NA.

We can get the mean as follows:

mean(vehicles$cyl, na.rm = TRUE)

## [1] 5.771867

Working with NAs can be tricky sometimes because they don’t always show up. For example, the output of table doesn’t show you the NAs, which could mislead you into thinking that there are no NAs in the column:

table(vehicles$cyl)

## 
##     2     3     4     5     6     8    10    12    16 
##    45   182 12381   718 11885  7550   138   478     7

The summary function does tell us though if there are NAs in a column:

summary(vehicles$cyl)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   4.000   6.000   5.772   6.000  16.000      58

To test if something is an NA or not, use the is.na function.

is.na(NA)

## [1] TRUE

Filtering (the old way)

What if I just want to look at observations which have more than 8 cylinders? To do that, we first need to know another way of extracting elements from a vector. Consider the vector below:

vec <- 1:3

To extract a group of elements from vec, we previously used square bracket notation, with a vector of indices that we wanted to extract:

vec[c(1,2)]

## [1] 1 2

Another way to extract elements is by putting a logical vector of the same length in the square brackets. R will then extract those elements which match to TRUE. For example, the code below extracts the first and third elements:

vec[c(TRUE, FALSE, TRUE)]

## [1] 1 3

To extract all the observations with more than 8 cylinders, we can do this:

df <- vehicles[vehicles$cyl > 8, ]
table(df$cyl)

## 
##  10  12  16 
## 138 478   7

To extract observations with exactly 8 cylinders (notice the double equal sign):

df <- vehicles[vehicles$cyl == 8, ]
table(df$cyl)

## 
##    8 
## 7550

To extract observations such that the number of cylinders is not 8:

df <- vehicles[vehicles$cyl != 8, ]
table(df$cyl)

## 
##     2     3     4     5     6    10    12    16 
##    45   182 12381   718 11885   138   478     7

This is the “old” way of filtering datasets. (Next week, we’ll talk about a newer way to do filtering and other data transformations.)

Optional material

Viewing a random sample of the dataset

Instead of just the first or last few rows, we may want to view a random sample of rows from the data frame. We can do this by composing functions that we already know with sample():

vehicles[sample(nrow(vehicles), 5), ]

##          id      make           model year                      class
## 10392 14073     Eagle           Talon 1998            Subcompact Cars
## 16719  9932   Hyundai         Elantra 1993               Compact Cars
## 25881   362  Plymouth         Horizon 1985               Compact Cars
## 11913 33187      Ford F150 Pickup 2WD 2013 Standard Pickup Trucks 2WD
## 4923   8563 Chevrolet        Corvette 1992                Two Seaters
##                 trans             drive cyl displ    fuel hwy cty
## 10392    Manual 5-spd Front-Wheel Drive   4   2.0 Regular  30  20
## 16719 Automatic 4-spd Front-Wheel Drive   4   1.8 Regular  26  20
## 25881    Manual 5-spd Front-Wheel Drive   4   2.2 Regular  33  22
## 11913 Automatic 6-spd  Rear-Wheel Drive   6   3.5 Regular  22  16
## 4923     Manual 6-spd  Rear-Wheel Drive   8   5.7 Premium  23  15

Computing the mode of a column

R doesn’t have a built-in function to compute the mode. We can either write our own function (a number of people have done that, do a google search), or we can use some other functions which allow us to figure out what the mode is.

First, the table function tells us how many times each value appeared in the column:

table(vehicles$hwy)

## 
##    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23 
##   13   66   62  275  295  453  847 1257 2094 1547 1605 2314 1400 2672 2383 
##   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38 
## 2788 1944 2712 1558 1448 1371  846  799  528  515  358  313  205  125  106 
##   39   40   41   42   43   44   45   46   47   48   49   50   51   52   53 
##  125   79   56   46   20   52   55    9   10    8   14    2    4    7    1 
##   54   58   59   60   61   62   64   65   68   69   74   79   90   92   93 
##    3    4    2    1    1    2    3    2    2    2    3    2    3    2    4 
##   96   97   99  101  102  105  108  109 
##    2    2    6    2    1    3    2    1

To find out which number appeared most often, we have to visually scan the whole table. We could sort the table to help us:

sort(table(vehicles$hwy))

## 
##   53   60   61  102  109   50   59   62   65   68   69   79   92   96   97 
##    1    1    1    1    1    2    2    2    2    2    2    2    2    2    2 
##  101  108   54   64   74   90  105   51   58   93   99   52   48   46   47 
##    2    2    3    3    3    3    3    4    4    4    6    7    8    9   10 
##    9   49   43   42   44   45   41   11   10   40   38   37   39   36   12 
##   13   14   20   46   52   55   56   62   66   79  106  125  125  205  275 
##   13   35   34   14   33   32   31   30   15   16   29   21   28   18   27 
##  295  313  358  453  515  528  799  846  847 1257 1371 1400 1448 1547 1558 
##   19   25   17   20   23   22   26   24 
## 1605 1944 2094 2314 2383 2672 2712 2788

The mode is the last entry (24, appearing 2788 times). To have the mode appear in front, adding a decreasing = TRUE argument to the function call:

sort(table(vehicles$hwy), decreasing = TRUE)

## 
##   24   26   22   23   20   17   25   19   27   18   28   21   29   16   15 
## 2788 2712 2672 2383 2314 2094 1944 1605 1558 1547 1448 1400 1371 1257  847 
##   30   31   32   33   14   34   35   13   12   36   37   39   38   40   10 
##  846  799  528  515  453  358  313  295  275  205  125  125  106   79   66 
##   11   41   45   44   42   43   49    9   47   46   48   52   99   51   58 
##   62   56   55   52   46   20   14   13   10    9    8    7    6    4    4 
##   93   54   64   74   90  105   50   59   62   65   68   69   79   92   96 
##    4    3    3    3    3    3    2    2    2    2    2    2    2    2    2 
##   97  101  108   53   60   61  102  109 
##    2    2    2    1    1    1    1    1

More on factors

By default, when we make a variable a factor, R assigns an internal labeling by alphabetical order. This usually doesn’t concern us. One instance where we might want to have more control over the ordering is when we plot the data: for a bar plot, the category labeled 1 goes on the left-most end, followed by 2, etc.

barplot(table(vehicles$drive))

If we want to, we can set the order ourselves by specifying a levels argument. Let’s flip the labeling:

vehicles$drive <- factor(vehicles$drive, 
                         levels = sort(unique(vehicles$drive), decreasing = TRUE))
levels(vehicles$drive)

## [1] "Rear-Wheel Drive"           "Part-time 4-Wheel Drive"   
## [3] "Front-Wheel Drive"          "All-Wheel Drive"           
## [5] "4-Wheel or All-Wheel Drive" "4-Wheel Drive"             
## [7] "2-Wheel Drive"

Note how the barplot is now “flipped”:

barplot(table(vehicles$drive))

For ordinal variables, we need to add an ordered = TRUE argument to factor():

vehicles$drive <- as.character(vehicles$drive)
vehicles$drive <- factor(vehicles$drive, ordered = TRUE)
str(vehicles$drive)

##  Ord.factor w/ 7 levels "2-Wheel Drive"<..: 1 1 1 1 7 7 7 5 5 5 ...

levels(vehicles$drive)

## [1] "2-Wheel Drive"              "4-Wheel Drive"             
## [3] "4-Wheel or All-Wheel Drive" "All-Wheel Drive"           
## [5] "Front-Wheel Drive"          "Part-time 4-Wheel Drive"   
## [7] "Rear-Wheel Drive"

head(vehicles$drive)

## [1] 2-Wheel Drive    2-Wheel Drive    2-Wheel Drive    2-Wheel Drive   
## [5] Rear-Wheel Drive Rear-Wheel Drive
## 7 Levels: 2-Wheel Drive < ... < Rear-Wheel Drive

Session info

This section is for documentation purposes: By displaying my session info, others who read this document will know what the system set-up was when I ran the commands above.

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] fueleconomy_0.1
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.1  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2
##  [5] tools_3.5.1     htmltools_0.3.6 yaml_2.1.19     Rcpp_0.12.17   
##  [9] stringi_1.2.3   rmarkdown_1.10  knitr_1.20      stringr_1.3.1  
## [13] digest_0.6.15   evaluate_0.10.1